IT departments often have targets like
An application must be available 99% of the time
An application may at most be unavailable for consecutive 5 minutes
Customers require those service levels and sign Service Level Agreements with IT departments.
JbossON should support the IT department with this by:
Allowing to define SLAs per resource or application (-> Correlation Units)
Compute the current SLA achievment value
Compute the „how many abnormal conditions are still allowed value"
Alert on missed SLAs
Have SLA Trends (-> Trends) computed in order to alert operations when a SLA breach is possible
Allow to enter scheduled downtimes into the system
In order to correctly compute SLA values, it is also necessary that scheduled downtimes can be entered into the system, so that those can be subtracted from the actual downtime values (or alternatively that we can set the availability for those timeframes to e.g. blue a.k.a „down due to known downtime". The latter can also be fed into the Alerting subsystem in order to prevent sending alerts for resources that are down on purpose.
Computation of the SLA values need to be relatively instantaneous.
Question: do we need to temporarily decrease collection intervals for resources that are out of bound? Suppose we are collection a metric every 5 minutes. If the last value is out of bounds and will be accounted as SLA violation and the situation goes back to normal after 1 minute, we would nevertehelss take the whole 5 minutes range into account. Decresing the interval to 1 minute would at most account it for 2 minutes (if we „just" missed the coming back to normal). If the situation is back to normal, we could restore the original 5 minutes interval.